1 Section 1: Introduction

Address the data science problem

2 Section 2: Data Cleaning

3 Section 3: The Analysis and Results

  • List of questions.

3.1 Q1: Job Titles Distributions

What percentage of the survey respondents are working under these job titles?

3.2 Q2: Salaries

What are the statistics on salaries for these job titles?

The US minimum wage is 7.25 per hour, multiply that for a total \(260\) typical number of workdays a year:

\[7.25 \frac{\$}{\text{hour}} \cdot 24 \frac{\text{hour}}{\text{day}} \cdot 260\frac{\text{day}}{\text{year}} = \$45240/\text{year}.\]

The new categories* for salaries will be: - poverty: below the federal minimum wage - low: 40,000 to 79,999 - medium: 80,000 to 124,999 - high: 125,000 to 199,999 - very high: 200,000 to 499,999 - highest: >= 500,000

*loosely based on US federal income tax brackets.

3.3 Q3: Requried Education Levels

What levels of education are required for these job titles?

3.4 Q4: Gender Gap in Income

Is there a significant income gap between genders for these jobs?

3.5 Q5: Skillset vs Income

What is the typical skill set for these jobs? How does it affect the pay rate?

3.6 Q6: Correlation between Industry and Job

Is there a certain correlation between industry and the need for these jobs?

## 
##  Pearson's Chi-squared test
## 
## data:  table(industry.dat$Q5, industry.dat$Q20)
## X-squared = 108.6, df = 25, p-value = 2.153e-12

3.7 Q7: Languages and IDEs

What programming languages and IDEs do they use?

Survey questions Q7 (daily-used programming language), Q9 (IDE).

3.8 Q8: Learning Sources

Where do they get and share the knowledge?

Survey questions Q39 (share and deploy), Q40 (learning resources), Q42 (Media sources).

3.10 Q10: Cloud Computing Platforms Preferences

As the demand for computational power increases along with the amount of data involved in data science industry, cloud computing is a topic that any data science practitioners cannot avoid. The major

Specifically, will a user’s preference for a cloud computing platforms affect his or her preferences for other tools?

Will a user’s preference for cloud computing platforms affect his or her preference for other tools? For example, we want to know if an AWS EC2 dedicated user will actually prefer AWS S3 over other products.

Survey question Q29-A: computing products (Part_1) Survey question Q30: Storage (Part_3, Part_4) Survey question Q31-A: ML products (Part_1)

\[\chi^2=\sum\frac{(O_i-E_i)^2}{E_i}\]

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  aws_user$ec2 and aws_user$s3
## X-squared = 1532, df = 1, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  aws_user$ec2 and aws_user$efs
## X-squared = 637.99, df = 1, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  aws_user$ec2 and aws_user$sagemaker
## X-squared = 566.34, df = 1, p-value < 2.2e-16
  • add a table or something else.

3.11 Q11: AWS, Azure and GCP

What is the overall AWS usage percentage among DS practitioners? Is it the same for Google Cloud? What about Microsoft Azure?

\(H_0: p_A=p_B\), \(H_a:p_A\not=p_B\), where \(A\) and \(B\) can be replaced by AWS, Azure, or GCP, and \(n_A\), \(n_B\) are sample size of group \(A\) and \(B\) respectively.

The test statistic (z-statistic) can be calculated as follow:

\[z=\frac{p_A-p_B}{\sqrt{p(1-p)/n_A+p(1-p)/n_B}}\]

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 81.285, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07606175 0.11865523
## sample estimates:
##    prop 1    prop 2 
## 0.2377358 0.1403774
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 1.8823, df = 1, p-value = 0.1701
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.005495458  0.031910552
## sample estimates:
##    prop 1    prop 2 
## 0.1403774 0.1271698
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 107.85, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1315249 -0.0896072
## sample estimates:
##    prop 1    prop 2 
## 0.1271698 0.2377358
  • explain
  • CI

3.12 Q12 Linear Regression on Salary

Honestly speaking, the best way to approach this question is to apply a logistic regression on the the dependent variable salary due to its categorical nature. However, in order to experiment a linear model, we take a bold move to transform the salary back to continuous by randomly sampling values in between the pay levels. Three typical languages used by data science practitioners are included as well: Python, R and SQL.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Start:  AIC=3.1
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + sql:r + sql:python + r:python + python:r:sql + 
##     job:sql + job:r + job:python + job:experience + job:education
## 
##                  Df Sum of Sq    RSS    AIC
## - job:sql        11     7.366 1498.5 -9.810
## - job:python     11     9.613 1500.7 -7.045
## - education:job  49    75.055 1566.2 -4.296
## - job:r          12    15.427 1506.5 -1.912
## - job:experience 65   105.598 1596.7 -0.661
## - python:r:sql    1     0.129 1491.2  1.259
## <none>                        1491.1  3.099
## - gender          4    33.081 1524.2 35.584
## - age_group      10    49.159 1540.3 42.943
## 
## Step:  AIC=-9.81
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + r:sql + python:sql + python:r + job:r + 
##     job:python + job:experience + education:job + python:r:sql
## 
##                  Df Sum of Sq    RSS      AIC
## - job:python     11    10.415 1508.9 -19.0305
## - education:job  49    74.735 1573.2 -18.0130
## - job:experience 66   104.636 1603.1 -17.2761
## - job:r          12    13.994 1512.5 -16.6593
## - python:r:sql    1     0.116 1498.6 -11.6670
## <none>                        1498.5  -9.8096
## - gender          4    33.324 1531.8  22.7705
## - age_group      10    49.541 1548.0  30.2010
## 
## Step:  AIC=-19.03
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + r:sql + python:sql + python:r + job:r + 
##     job:experience + education:job + python:r:sql
## 
##                  Df Sum of Sq    RSS     AIC
## - education:job  49    71.194 1580.1 -31.970
## - job:r          12    14.428 1523.3 -25.473
## - python:r:sql    1     0.448 1509.3 -20.483
## <none>                        1508.9 -19.031
## - job:experience 67   115.338 1624.2 -17.131
## - gender          4    34.627 1543.5  14.831
## - age_group      10    49.253 1558.2  20.231
## 
## Step:  AIC=-31.97
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + r:sql + python:sql + python:r + job:r + 
##     job:experience + python:r:sql
## 
##                  Df Sum of Sq    RSS     AIC
## - job:experience 69   114.216 1694.3 -41.205
## - job:r          12    17.367 1597.5 -35.802
## - python:r:sql    1     0.158 1580.2 -33.786
## <none>                        1580.1 -31.970
## - education       6    12.916 1593.0 -28.949
## - gender          4    35.101 1615.2   0.567
## - age_group      10    45.898 1626.0   0.859
## 
## Step:  AIC=-41.21
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + r:sql + python:sql + python:r + job:r + 
##     python:r:sql
## 
##                Df Sum of Sq    RSS     AIC
## - job:r        12    18.759 1713.1 -44.890
## - python:r:sql  1     0.075 1694.4 -43.124
## <none>                      1694.3 -41.205
## - education     6    14.747 1709.0 -37.216
## - gender        4    40.921 1735.2  -5.174
## - age_group    10    62.087 1756.4   5.194
## - experience    6    76.442 1770.8  28.212
## 
## Step:  AIC=-44.89
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + r:sql + python:sql + python:r + python:r:sql
## 
##                Df Sum of Sq    RSS     AIC
## - python:r:sql  1     0.136 1713.2 -46.744
## <none>                      1713.1 -44.890
## - education     6    13.016 1726.1 -42.925
## - gender        4    37.809 1750.9 -12.612
## - age_group    10    63.458 1776.5   2.220
## - job          12    79.981 1793.0  15.301
## - experience    6    73.691 1786.8  20.817
## 
## Step:  AIC=-46.74
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + r:sql + python:sql + python:r
## 
##              Df Sum of Sq    RSS     AIC
## - r:sql       1     0.023 1713.2 -48.719
## - python:r    1     1.249 1714.5 -47.399
## <none>                    1713.2 -46.744
## - education   6    12.952 1726.2 -44.848
## - python:sql  1     4.106 1717.3 -44.328
## - gender      4    37.739 1750.9 -14.542
## - age_group  10    63.863 1777.1   0.782
## - job        12    79.876 1793.1  13.332
## - experience  6    75.066 1788.3  20.377
## 
## Step:  AIC=-48.72
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + python:sql + python:r
## 
##              Df Sum of Sq    RSS     AIC
## - python:r    1     1.329 1714.5 -49.289
## <none>                    1713.2 -48.719
## - education   6    12.932 1726.2 -46.845
## - python:sql  1     4.113 1717.3 -46.295
## - gender      4    37.720 1750.9 -16.539
## - age_group  10    63.965 1777.2  -1.089
## - job        12    79.955 1793.2  11.436
## - experience  6    75.178 1788.4  18.515
## 
## Step:  AIC=-49.29
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + r + sql + python:sql
## 
##              Df Sum of Sq    RSS     AIC
## - r           1     0.588 1715.1 -50.656
## <none>                    1714.5 -49.289
## - education   6    13.274 1727.8 -47.060
## - python:sql  1     4.123 1718.7 -46.857
## - gender      4    38.204 1752.8 -16.630
## - age_group  10    64.559 1779.1  -1.094
## - job        12    79.665 1794.2  10.505
## - experience  6    78.185 1792.7  20.983
## 
## Step:  AIC=-50.66
## log(salary) ~ age_group + gender + education + job + experience + 
##     python + sql + python:sql
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    1715.1 -50.656
## - python:sql  1     4.322 1719.5 -48.013
## - education   6    14.190 1729.3 -47.455
## - gender      4    37.922 1753.1 -18.308
## - age_group  10    64.867 1780.0  -2.166
## - job        12    79.080 1794.2   8.508
## - experience  6    79.142 1794.3  20.572
## 
## Call:
## lm(formula = log(salary) ~ age_group + gender + education + job + 
##     experience + python + sql + python:sql, data = lm_dat)
## 
## Coefficients:
##                                                                (Intercept)  
##                                                                  10.200984  
##                                                             age_group22-24  
##                                                                   0.240567  
##                                                             age_group25-29  
##                                                                   0.485461  
##                                                             age_group30-34  
##                                                                   0.774225  
##                                                             age_group35-39  
##                                                                   0.823671  
##                                                             age_group40-44  
##                                                                   0.766916  
##                                                             age_group45-49  
##                                                                   0.799968  
##                                                             age_group50-54  
##                                                                   0.981098  
##                                                             age_group55-59  
##                                                                   0.787097  
##                                                             age_group60-69  
##                                                                   0.562732  
##                                                               age_group70+  
##                                                                   0.358895  
##                                                            genderNonbinary  
##                                                                  -0.559571  
##                                                    genderPrefer not to say  
##                                                                  -0.478275  
##                                              genderPrefer to self-describe  
##                                                                  -0.548671  
##                                                                genderWoman  
##                                                                  -0.312198  
##                                                   educationDoctoral degree  
##                                                                   0.209169  
##                                            educationI prefer not to answer  
##                                                                  -0.455451  
##                                                   educationMaster’s degree  
##                                                                   0.115979  
##                              educationNo formal education past high school  
##                                                                  -0.109712  
##                                            educationProfessional doctorate  
##                                                                   0.195027  
## educationSome college/university study without earning a bachelor’s degree  
##                                                                  -0.081503  
##                                                            jobData Analyst  
##                                                                   0.005641  
##                                                           jobData Engineer  
##                                                                   0.134129  
##                                                          jobData Scientist  
##                                                                   0.332849  
##                                                   jobDBA/Database Engineer  
##                                                                  -0.075810  
##                                            jobDeveloper Relations/Advocacy  
##                                                                   1.308019  
##                                               jobMachine Learning Engineer  
##                                                                   0.349711  
##                                                                   jobOther  
##                                                                   0.068385  
##                                                         jobProduct Manager  
##                                                                   0.769284  
##                                                 jobProgram/Project Manager  
##                                                                   0.404084  
##                                                      jobResearch Scientist  
##                                                                  -0.159111  
##                                                       jobSoftware Engineer  
##                                                                   0.179390  
##                                                            jobStatistician  
##                                                                  -0.340587  
##                                                        experience1-3 years  
##                                                                   0.085388  
##                                                      experience10-20 years  
##                                                                   0.645163  
##                                                        experience20+ years  
##                                                                   0.691455  
##                                                        experience3-5 years  
##                                                                   0.285039  
##                                                       experience5-10 years  
##                                                                   0.494777  
##                                        experienceI have never written code  
##                                                                  -0.130967  
##                                                                     python  
##                                                                  -0.130113  
##                                                                        sql  
##                                                                  -0.200599  
##                                                                 python:sql  
##                                                                   0.271468

Besides summary() table,

## Analysis of Variance Table
## 
## Response: log(salary)
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## age_group    10  160.14 16.0136 16.8339 < 2.2e-16 ***
## gender        4   71.08 17.7694 18.6796 4.666e-15 ***
## education     6   34.20  5.6995  5.9915 3.246e-06 ***
## job          12  128.81 10.7340 11.2839 < 2.2e-16 ***
## experience    6   83.35 13.8922 14.6039 2.415e-16 ***
## python        1    0.01  0.0077  0.0081   0.92829    
## sql           1    0.20  0.2040  0.2145   0.64334    
## python:sql    1    4.32  4.3217  4.5430   0.03319 *  
## Residuals  1803 1715.14  0.9513                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4 Section 4: Conclusions

  • findings as bullet points:
  1. We don’t have a one-sentence conclusion to summarize all the findings above. However, what we learned from this project is more about…
  2. gender inequality
  • Besides the loose collection of findings above, we believe that (ML/cloud computing) is the general trend of data science in the next couple of years.

  • More to discover from the survey data. Comparison with historical data blah blah blah.

  • Limitation. (from our proposal, last paragraph). Kaggle users != ds practitioners; lots of students -> not representative enough to reflect the reality in the industry.

5 Appendix: R scripts

# invalidate cache when the package version changes
knitr::opts_chunk$set(cache = TRUE,
                      echo = FALSE,
                      eval = TRUE,
                      tidy = FALSE,
                      warning = FALSE,
                      cache.extra = packageVersion('tidyverse'))
options(htmltools.dir.version = FALSE)

if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, ggthemes, latex2exp, glue,
               hrbrthemes, plotly, stringr, DT, extrafont,
               tidymodels)

set.seed(511)

ggplot2::theme_set(
    theme_fivethirtyeight() +
    theme(
        text = element_text(family = "Roboto Condensed"),
        title = element_text(size = 14),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 10),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        panel.grid.minor.x = element_blank()
    )
)
Since the same codings will be used in Problem 1-c, I will just use this part as a setup.
col_names <- names(read_csv(
  "data/kaggle_survey_2021_responses.csv",
  n_max=0))
dat <- read_csv(
  "data/kaggle_survey_2021_responses.csv",
  col_names = col_names, skip=2)

dat <- dat %>%
  filter(Q3=="United States of America" )

job.dat <- dat %>% 
    filter(Q5 %in% c("Data Analyst",
                     "Data Engineer",
                     "Data Scientist",
                     "Machine Learning Engineer",
                     "Software Engineer",
                     "Statistician",
                     "Student"))
Since the one large data science problem can be divided into several smaller bits, we decide to combine the analysis and the interpretation sections together. And the questions will be addressed and discussed one by one.
The following questions are about data science career paths.
jtitle <- sort(table(dat$Q5), decreasing = T) %>% 
    as.data.frame() %>%
    as.tibble()
jtitle <- rename(jtitle, `Job Title` = Var1)

ggplot(jtitle, aes(x="", y=Freq, fill=`Job Title`)) +
    geom_bar(stat="identity", width=1, color="white") +
    coord_polar("y", start=0) + 
    theme_void() 
# To change salary categories into FACTOR dtype with descending labels
poverty   <- c("$0-999", "1,000-1,999" , "2,000-2,999", "3,000-3,999", "4,000-4,999", "5,000-7,499", "7,500-9,999",
             "10,000-14,999", "15,000-19,999", "20,000-24,999", "25,000-29,999", "30,000-39,999") 
low       <- c("40,000-49,999", "50,000-59,999", "60,000-69,999", "70,000-79,999")
medium    <- c("80,000-89,999", "90,000-99,999", "100,000-124,999")
high      <- c("125,000-149,999", "150,000-199,999")
very_high <- c("200,000-249,999", "250,000-299,999", "300,000-499,999")
highest   <- c("$500,000-999,999", ">$1,000,000")

dat$Q25[dat$Q25 %in% poverty] <- "poverty"
dat$Q25[dat$Q25 %in% low] <- "low"
dat$Q25[dat$Q25 %in% medium] <- "medium"
dat$Q25[dat$Q25 %in% high] <- "high"
dat$Q25[dat$Q25 %in% very_high] <- "very high"
dat$Q25[dat$Q25 %in% highest] <- "highest"
dat$Q25 <- factor(dat$Q25, levels = c("poverty", "low", "medium", "high", "very high", "highest"), ordered = T)

data_side <- c("Data Scientist", "Data Analyst", "Business Analyst", "Data Engineer", "Statistician", "DBA/Database Engineer")
swe_side <- c("Software Engineer", "Machine Learning Engineer", "Program/Project Manager", "Product Manager") 
academia <- c("Student", "Other", "Research Scientist") 

dat[dat$Q5 %in% data_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
  ggplot( aes(x=Q5, y=Q25, color = Q5)) +
    geom_count() + 
    ggtitle("Two-Way Salary Visualizations: Data-Oriented Jobs") +
    xlab("") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))


dat[dat$Q5 %in% swe_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
  ggplot( aes(x=Q5, y=Q25, color = Q5)) +
    geom_count() + 
    ggtitle("Two-Way Salary Visualizations: Engineering-Oriented Jobs") +
    xlab("") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

dat[dat$Q5 %in% academia & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
  ggplot( aes(x=Q5, y=Q25, color = Q5)) +
    geom_count() + 
    ggtitle("Two-Way Salary Visualizations: Academic Jobs and Others") +
    xlab("") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
skill.set <- job.dat %>% 
    select(c(Q5, starts_with("Q7_"), starts_with("Q9_"), 
             starts_with("Q12_"), starts_with("Q14_"),
             starts_with("Q16_"), starts_with("Q17_"),
             starts_with("Q18_"), starts_with("Q19_"))) %>%
    mutate(Total = "TotalHelper") %>%
    gather("fake_key", "skillset", -Q5, na.rm = T) %>%
    filter(!skillset %in% c("None", "Other")) %>%
    rename(title = Q5) %>%
    count(title, skillset) %>%
    group_by(title) %>%
    mutate(prop = round(n / max(n), 3)) %>%
    filter(prop >= 0.1 & skillset != "TotalHelper") %>%
    select(-n) %>%
    arrange(title, desc(prop))

datatable(skill.set, filter = 'top', width = 600)
industry.dat <- job.dat %>%
    filter(Q5 != "Student") %>%
    select(Q5, Q20, Q25) %>%
    filter(Q20 %in% c("Academics/Education", 
                      "Accounting/Finance", 
                      "Computers/Technology",
                      "Insurance/Risk Assessment",
                      "Medical/Pharmaceutical",
                      "Online Service/Internet-based Services")) %>%
    mutate(Q25 = str_remove_all(Q25, "[$,]")) %>%
    mutate(Q25 = str_replace(Q25, ">1000000", "1000000-2000000")) %>%
    separate(Q25, into = c("salary_lb", "salary_ub"), sep = "-") %>%
    mutate(salary_lb = as.numeric(salary_lb)) %>%
    mutate(salary_ub = as.numeric(salary_ub))

p <- industry.dat %>% 
    count(Q5, Q20) %>%
    mutate(Q20 = fct_reorder(Q20, n, .fun="sum")) %>%
    rename(title=Q5, Industry=Q20, count=n) %>%
    ggplot(aes(x=title, y=count)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    facet_wrap(~ Industry) +
    labs(
        title = "Users' work industry",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p)
chisq.test(table(industry.dat$Q5, industry.dat$Q20))
industry.dat %>% 
    mutate(Q20 = fct_reorder(Q20, salary_lb, .fun='length')) %>%
    ggplot(aes(x=Q20, y=salary_lb)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~ Q5) +
    labs(
        title = "Users' salary vs industry",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
The following questions are majorly about data science skills, tools and technologies.
programming <- job.dat %>% 
    select(c(Q5, starts_with("Q7_"))) %>%
    gather("fake_key", "language", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    filter(!language %in% c("None", "Other")) %>%
    count(title, language, .drop = FALSE) %>% 
    complete(title, language) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- programming %>% 
    mutate(text = paste0("Language: ", language, "\n", 
                         "Job title: ", title, "\n", 
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(language, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite programming language",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
ide <- job.dat %>% 
    select(c(Q5, starts_with("Q9_"))) %>%
    gather("fake_key", "IDE", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    mutate(IDE = case_when(
        IDE == "Visual Studio Code (VSCode)" ~ "VSCode",
        IDE == "Jupyter (JupyterLab, Jupyter Notebooks, etc)" ~ "Jupyter Notebook",
        TRUE ~ IDE
    )) %>%
    filter(!IDE %in% c("None", "Other")) %>%
    count(title, IDE, .drop = FALSE) %>% 
    complete(title, IDE) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- ide %>% 
    mutate(text = paste0("IDE: ", IDE, "\n", 
                         "Job title: ", title, "\n", 
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(IDE, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite IDE",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
learning_platform <- job.dat %>%
    select(c(Q5, starts_with("Q40_"))) %>%
    gather("fake_key", "learning", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    mutate(learning = case_when(
        learning == "Cloud-certification programs (direct from AWS, Azure, GCP, or similar)" ~ "Cloud-certif Programs",
        learning == "University Courses (resulting in a university degree)" ~ "University",
        TRUE ~ learning
    )) %>%
    filter(!learning %in% c("None", "Other")) %>%
    count(title, learning, .drop = FALSE) %>%
    complete(title, learning) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- learning_platform %>%
    mutate(text = paste0("Platform: ", learning, "\n",
                         "Job title: ", title, "\n",
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(learning, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite learning platforms",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
share_deploy <- job.dat %>% 
    select(c(Q5, starts_with("Q39_"))) %>%
    gather("fake_key", "share", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    mutate(share = case_when(
        share == "I do not share my work publicly" ~ "\'PRIVATE\'",
        TRUE ~ share
    )) %>%
    filter(!share %in% c("Other")) %>%
    count(title, share, .drop = FALSE) %>% 
    complete(title, share) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- share_deploy %>% 
    mutate(text = paste0("Platform: ", share, "\n", 
                         "Job title: ", title, "\n",
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(share, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite share platforms",
        x = "",
        y = "",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
media_source <- job.dat %>% 
    select(c(Q5, starts_with("Q42_"))) %>%
    gather("fake_key", "media", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    filter(!media %in% c("None", "Other")) %>%
    count(title, media, .drop = FALSE) %>% 
    complete(title, media) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n)) %>%
    separate(media, into = c("media", "media_suffix"), sep = " \\(")

p <- media_source %>% 
    mutate(text = paste0("Platform: ", media, "\n", 
                         "Job title: ", title, "\n", 
                         "Count: ", n, "\n", 
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(media, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite media source",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
# Q9

viz_lib <- dat %>%
  select(Q5, starts_with("Q14")) %>%
  select(-c(Q14_Part_11, Q14_OTHER))

viz_lib <- viz_lib %>%
  pivot_longer(cols=starts_with("Q14")) %>%
  select(-name) %>%
  drop_na() %>%
  filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
                   "Developer Relations/Advocacy",
                   "Currently not employed",
                   "Statistician", "Product Manager")))

ggplot(viz_lib) +
  geom_bar(aes(y = value, fill = value)) +
  facet_wrap(~ Q5) +
  scale_fill_brewer(palette = "Spectral") +
  labs(
    title = "Data science practitioners' favorite viz libraries",
    x = "",
    y = "",
    caption = glue("Author: celeritasML
                   Source: Kaggle")
  ) +
  theme(
        axis.ticks.x = element_line(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
ml_lib <- dat %>%
  select(Q5, starts_with("Q16")) %>%
  select(-c(Q16_Part_17, Q16_OTHER))

ml_lib <- ml_lib %>%
  pivot_longer(cols=starts_with("Q16")) %>%
  select(-name) %>%
  drop_na() %>%
  filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
                   "Developer Relations/Advocacy",
                   "Currently not employed",
                   "Statistician", "Product Manager")))

top_10_ml <- ml_lib %>%
  group_by(value) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(10)

ml_lib <- ml_lib %>%
  filter(value %in% top_10_ml$value)

ggplot(ml_lib) +
  geom_bar(aes(y = value, fill = value)) +
  facet_wrap(~ Q5) +
  scale_fill_brewer(palette = "Spectral") +
  labs(
    title = "Data science practitioners' favorite ML libraries",
    x = "",
    y = "",
    caption = glue("Author: celeritasML
                   Source: Kaggle")
  ) +
  theme(
        axis.ticks.x = element_line(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
viz_lib %>%
  group_by(Q5) %>%
  table() %>%
  addmargins() %>%
  print()

ml_lib %>%
  group_by(Q5) %>%
  table() %>%
  addmargins() %>%
  print()
# Q10

aws_user <- tibble(
  ec2 = dat$Q29_A_Part_1,
  s3 = dat$Q30_A_Part_3,
  efs = dat$Q30_A_Part_4,
  sagemaker = dat$Q31_A_Part_1,
  redshift = dat$Q32_A_Part_11,
  aurora = dat$Q32_A_Part_12,
  rds = dat$Q32_A_Part_13,
  dynamodb = dat$Q32_A_Part_14
  
) %>%
  mutate(ec2 = if_else(is.na(ec2), 0, 1),
         s3 = if_else(is.na(s3), 0, 1),
         efs = if_else(is.na(efs), 0, 1),
         sagemaker = if_else(is.na(sagemaker), 0, 1),
         redshift = if_else(is.na(redshift), 0, 1),
         aurora = if_else(is.na(aurora), 0, 1),
         rds = if_else(is.na(rds), 0, 1),
         dynamodb = if_else(is.na(dynamodb), 0, 1))
chisq.test(aws_user$ec2, aws_user$s3)
chisq.test(aws_user$ec2, aws_user$efs)
chisq.test(aws_user$ec2, aws_user$sagemaker)
# Q11

cloud_comp <- tibble(
  aws_usage = dat$Q27_A_Part_1,
  azure_usage = dat$Q27_A_Part_2,
  gcp_usage = dat$Q27_A_Part_3
) %>%
  mutate(aws_usage = if_else(is.na(aws_usage), FALSE, TRUE),
         azure_usage = if_else(is.na(azure_usage), FALSE, TRUE),
         gcp_usage = if_else(is.na(gcp_usage), FALSE, TRUE))
prop.test(c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)),
          c(nrow(cloud_comp), nrow(cloud_comp)),
          alternative = "two.sided",
          correct = TRUE)
prop.test(c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)),
          c(nrow(cloud_comp), nrow(cloud_comp)),
          alternative = "two.sided",
          correct = TRUE)
prop.test(c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)),
          c(nrow(cloud_comp), nrow(cloud_comp)),
          alternative = "two.sided",
          correct = TRUE)
# Q12

lm_dat <- dat %>%
  select(Q1, Q2, Q4, Q5, Q6,
         Q7_Part_1, Q7_Part_2, Q7_Part_3,
         Q25) %>%
  drop_na(Q25) %>%
  rename(age_group = Q1,
         gender = Q2,
         education = Q4,
         job = Q5,
         experience = Q6,
         salary = Q25,
         python = Q7_Part_1,
         r = Q7_Part_2,
         sql = Q7_Part_3)

set.seed(511)

# - poverty: below the federal minimum wage
# - low: 40,000 to 79,999
# - medium: 80,000 to 124,999
# - high: 125,000 to 199,999
# - very high: 200,000 to 499,999
# - highest: >= 500,000

lm_dat <- lm_dat %>%
  rowwise() %>%
  mutate(salary = case_when(
    salary == "poverty" ~ sample(1:39999, 1),
    salary == "low" ~ sample(40000:79999, 1),
    salary == "medium" ~ sample(80000:124999, 1),
    salary == "high" ~ sample(125000:199999, 1),
    salary == "very high" ~ sample(200000:499999, 1),
    salary == "highest" ~ sample(500000:2000000, 1)
  )) %>%
  ungroup()

ggplot(lm_dat, aes(x=salary)) +
  geom_histogram() +
  labs(
        title = "Histogram of salaries",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))

ggplot(lm_dat, aes(x=log(salary))) +
  geom_histogram() +
  labs(
        title = "Histogram of log(salary)",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))

lm_dat <- lm_dat %>%
  mutate(python = if_else(is.na(python), 0, 1),
         r = if_else(is.na(r), 0, 1),
         sql = if_else(is.na(sql), 0, 1))
model1 <- lm(log(salary) ~ . + sql:r + sql:python + r:python + python:r:sql +
               job:sql + job:r + job:python + job:experience +
               job:education, data=lm_dat)
MASS::stepAIC(model1)
model2 <- lm(log(salary) ~ age_group + gender + education + job + experience + 
    python + sql + python:sql, data = lm_dat)

anova(model2)

See footnote 11.


  1. See GitHub repository tufte.↩︎